AITopics | exam question

Collaborating Authors

exam question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

117c5c8622b0d539f74f6d1fb082a2e9-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-25-2026, 00:47:32 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Education > Assessment & Standards (0.68)
Education > Educational Setting > K-12 Education > Secondary School (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

117c5c8622b0d539f74f6d1fb082a2e9-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-8-2026, 00:24:08 GMT

dataset, evaluation, llm, (15 more...)

Neural Information Processing Systems

Country:

Asia > Thailand (0.05)
Africa > Kenya (0.04)
Asia > China > Beijing > Beijing (0.04)
(12 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.67)
Education > Assessment & Standards (0.67)
Education > Educational Setting > K-12 Education > Secondary School (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Neural Information Processing SystemsDec-23-2025, 22:57:00 GMT

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development.

m3exam, multilevel benchmark, multilingual, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Zotos, Leonidas, de Jong, Ivo Pascal, Valdenegro-Toro, Matias, Sburlea, Andreea Ioana, Nissim, Malvina, van Rijn, Hedderik

arXiv.org Artificial IntelligenceNov-18-2025

Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.03294

Country:

North America > Mexico (0.29)
Europe > Italy (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy

Kumar, Ramya, Gulwani, Dhruv, Singh, Sonit

arXiv.org Artificial IntelligenceNov-17-2025

This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.

classification, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.10903

Country:

Asia (0.46)
Europe (0.46)
Oceania > Australia (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Setting > Online (0.46)
Education > Educational Technology > Educational Software (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.55)

Add feedback

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Isley, Calvin, Gilbert, Joshua, Kassos, Evangelos, Kocher, Michaela, Nie, Allen, Brunskill, Emma, Domingue, Ben, Hofman, Jake, Legewie, Joscha, Svoronos, Teddy, Tuminelli, Charlotte, Goel, Sharad

arXiv.org Artificial IntelligenceAug-13-2025

While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes -- covering computer science, mathematics, chemistry, and more -- in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.

exam, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.08314

Country: North America > United States > California > Santa Clara County (0.14)

Genre:

Research Report > New Finding (1.00)
Instructional Material > Course Syllabus & Notes (0.88)

Industry: Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Truyts, Cesar Augusto Madid, Rabelo, Amanda Gomes, de Souza, Gabriel Mesquita, Lages, Daniel Scaldaferri, Pereira, Adriano Jose, Flato, Uri Adrian Prync, Reis, Eduardo Pontes dos, Vieira, Joaquim Edson, Silveira, Paulo Sergio Panse, Junior, Edson Amaro

arXiv.org Artificial IntelligenceJul-29-2025

Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.

accuracy, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.19885

Country: South America > Brazil > São Paulo (0.24)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.70)

Add feedback

Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Ismailov, Altynbek, Asanova, Salia

arXiv.org Artificial IntelligenceJul-23-2025

Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier ("max" to "min"); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three "reasoning-tuned" versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.

exam question, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.15868

Genre: Research Report > Experimental Study (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

State Bar of California admits it used AI to develop exam questions, triggering new furor

Los Angeles TimesApr-23-2025, 10:00:53 GMT

Nearly two months after hundreds of prospective California lawyers complained that their bar exams were plagued with technical problems and irregularities, the state's legal licensing body has caused fresh outrage by admitting that some multiple-choice questions were developed with the aid of artificial intelligence. The State Bar of California said in a news release Monday that it will ask the California Supreme Court to adjust test scores for those who took its February bar exam. But it declined to acknowledge significant problems with its multiple-choice questions -- even as it revealed that a subset of questions were recycled from a first-year law student exam, while others were developed with the assistance of AI by ACS Ventures, the State Bar's independent psychometrician. "The debacle that was the February 2025 bar exam is worse than we imagined," said Mary Basick, assistant dean of academic skills at UC Irvine Law School. Having the questions drafted by non-lawyers using ...

artificial intelligence, press release, state bar, (12 more...)

Los Angeles Times

Country: North America > United States > California (1.00)

Genre: Press Release (0.35)

Industry:

Law > Government & the Courts (0.73)
Education > Educational Setting > Higher Education (0.70)
Education > Curriculum > Subject-Specific Education (0.70)
Government > Regional Government > North America Government > United States Government (0.36)

Technology: Information Technology > Artificial Intelligence > Applied AI (0.86)

Add feedback

What is a Good Question? Utility Estimation with LLM-based Simulations

Lee, Dong-Ho, Cho, Hyundong, May, Jonathan, Pujara, Jay

arXiv.org Artificial IntelligenceFeb-24-2025

Asking questions is a fundamental aspect of learning that facilitates deeper understanding. However, characterizing and crafting questions that effectively improve learning remains elusive. To address this gap, we propose QUEST (Question Utility Estimation with Simulated Tests). QUEST simulates a learning environment that enables the quantification of a question's utility based on its direct impact on improving learning outcomes. Furthermore, we can identify high-utility questions and use them to fine-tune question generation models with rejection sampling. We find that questions generated by models trained with rejection sampling based on question utility result in exam scores that are higher by at least 20% than those from specialized prompting grounded on educational objectives literature and models fine-tuned with indirect measures of question quality, such as saliency and expected information gain.

computational linguistic, information, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2502.17383

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.14)
North America > Mexico > Mexico City > Mexico City (0.04)
(10 more...)

Genre:

Research Report > New Finding (1.00)
Instructional Material (0.88)

Industry: Education > Instructional Theory (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback